AITopics | Casper

Collaborating Authors

Casper

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

Anderson, Carolyn Jane, Biswas, Joydeep, Boruch-Gruszecki, Aleksander, Cassano, Federico, Feldman, Molly Q, Guha, Arjun, Lucchetti, Francesca, Wu, Zixuan

arXiv.org Artificial IntelligenceFeb-6-2025

Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.

benchmark, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.01584

Country:

South America (0.04)
Oceania > Australia (0.04)
North America > United States > Wyoming > Natrona County > Casper (0.04)
(10 more...)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MINION: a Large-Scale and Diverse Dataset for Multilingual Event Detection

Veyseh, Amir Pouran Ben, Van Nguyen, Minh, Dernoncourt, Franck, Nguyen, Thien Huu

arXiv.org Artificial IntelligenceNov-17-2022

Event Detection (ED) is the task of identifying and classifying trigger words of event mentions in text. Despite considerable research efforts in recent years for English text, the task of ED in other languages has been significantly less explored. Switching to non-English languages, important research questions for ED include how well existing ED models perform on different languages, how challenging ED is in other languages, and how well ED knowledge and annotation can be transferred across languages. To answer those questions, it is crucial to obtain multilingual ED datasets that provide consistent event annotation for multiple languages. There exist some multilingual ED datasets; however, they tend to cover a handful of languages and mainly focus on popular ones. Many languages are not covered in existing multilingual ED datasets. In addition, the current datasets are often small and not accessible to the public. To overcome those shortcomings, we introduce a new large-scale multilingual dataset for ED (called MINION) that consistently annotates events for 8 different languages; 5 of them have not been supported by existing multilingual datasets. We also perform extensive experiments and analysis to demonstrate the challenges and transferability of ED across languages in MINION that in all call for more research effort in this area.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2211.05958

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Oregon > Lane County > Eugene (0.14)
North America > Dominican Republic (0.04)
(16 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Government > Military (1.00)
Government > Regional Government > North America Government > United States Government (0.93)
Transportation (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Janice: Excited for eclipse

FOX NewsAug-19-2017, 20:22:37 GMT

I was 8-years-old and remember being both terrified and intrigued about something that was being talked about everywhere. This wasn't a storyline out of a science fiction movie or novel, this was real, and happening here on Earth. Millions of people were going to witness something that maybe happens a couple of times in our lifetime: A total solar eclipse. Our teachers were planning lessons about this incredible celestial event. Chalkboard diagrams, planetary mobiles and handmade viewing devices were being created out of shoe boxes.

artificial intelligence, eclipse, science fiction, (11 more...)

FOX News

Country:

North America > United States > Missouri > Jackson County > Kansas City (0.15)
North America > United States > South Carolina > Greenville County > Greenville (0.06)
North America > United States > Wyoming > Natrona County > Casper (0.05)
(6 more...)

Industry: Media > Film (0.50)

Technology: Information Technology > Artificial Intelligence > Science Fiction (0.50)

Add feedback